Cluster based Chinese abbreviation modeling

نویسندگان

  • Yangyang Shi
  • Yi-Cheng Pan
  • Mei-Yuh Hwang
چکیده

Abbreviations in Chinese are widely observed in Chinese spoken language. Automatic generation of Chinese abbreviations helps to improve Chinese natural language understanding systems and Chinese search engine. The abbreviation generation is treated as a character-based tagging problem. Due to limited training data, Chinese abbreviation generation suffers from data sparseness. Two types of strategies are proposed to reduce the impact from data sparseness. First of all, in addition to using a traditional sequence labelling method – Conditional Random Fields (CRF), we propose to apply Recurrent Neural Network with Maximum Entropy Extension (RNNME) [9], which actually shows similar performance as using CRF in our experiment. Secondly, we propose to use training data clustering and latent topic modeling in abbreviation generation. Using training data clustering or topic modeling not only addresses the data sparseness, but also takes advantage of the fact that full-names from the same cluster or the same latent topic have similar abbreviation patterns. Our experimental results show that using manual clustering, the accuracy of abbreviation generation achieves relatively 8% improvement. Using Latent topics that are obtained from Latent Dirichlet Allocation (LDA), the accuracy achieves relative 10% improvement.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Chinese Abbreviation Identification Using Abbreviation-Template

Chinese abbreviations are frequently used without being defined, which has brought much difficulty into NLP. In this study, the definition-independent abbreviation identification problem is proposed and resolved as a classification task in which abbreviation candidates are classified as either ‘abbreviation’ or ‘non-abbreviation’ according to the posterior probability. To meet our aim of identi...

متن کامل

Vocabulary expansion through automatic abbreviation generation for Chinese voice search

Long named entities are often abbreviated in oral Chinese language, and this usually leads to out-of-vocabulary(OOV) problems in speech recognition applications. The generation of Chinese abbreviations is much more complex than English abbreviations, most of which are acronyms and truncations. In this paper, we propose a new method for automatically generating abbreviations for Chinese named en...

متن کامل

Mining Atomic Chinese Abbreviation Pairs with a Probabilistic Single Character Word Recovery Model

An HMM-based Single Character Recovery (SCR) Model is proposed in this paper to extract a large set of “atomic abbreviation pairs”from a text corpus. By an “atomic abbreviation pair,”it refers to an abbreviated word and its root word (i.e., unabbreviated form) in which the abbreviation is a single Chinese character. This task is important since Chinese abbreviations cannot be enumerated exhaust...

متن کامل

Mining atomic Chinese abbreviations with a probabilistic single character recovery model

An HMM-based single character recovery (SCR) model is proposed in this paper to extract a large set of atomic abbreviations and their full forms from a text corpus. By an ‘‘atomic abbreviation,’’ it refers to an abbreviated word consisting of a single Chinese character. This task is important since Chinese abbreviations cannot be enumerated exhaustively but the abbreviation process for compound...

متن کامل

A Chinese Dataset with Negative Full Forms for General Abbreviation Prediction

Abbreviation is a common phenomenon across languages, especially in Chinese. In most cases, if an expression can be abbreviated, its abbreviation is used more often than its fully expanded forms, since people tend to convey information in a most concise way. For various language processing tasks, abbreviation is an obstacle to improving the performance, as the textual form of an abbreviation do...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014